I've played around a bit with Markov Chains and have found them to be a fun and simple way to generate sentences from existing data. A problem I've had was to actually aquire such data for generating the markov chain. I've used The Entire Bee Movie Script and the updated version of the Gutenberg dataset as an input, but none of there really satisfied me. The sentences end up just being weird and don't really make sense in context.
That was the moment I realized I needed data, lots of data and it hat to be somehow realistic. The end goal was to build a chatbot you could chat with and get okish responses. Obtaining data is hard, due to throwing all moral understanding overboard being bad habit. The data we need ist the data produced in private chats, so I tried inserting my telegram history into the markov chain, but the result was kind of underwhelming. I didn't really find the right settings for the sentences to be appropriate and it just isn't enough data.
Then I had the idea of building a bot people could talk to and use that conversation as data for the markov chain. Problem: this presupposes that such a bot already exists: a contradiction. We can solve this problem by completly eliminating the bot:
I built a small bot that just functions as a gateway connecting two people. The people think that they are talking to a bot, but in the end, they're talking to each other. This led to really interesting conversations and I kind of felt as if I was watching a turing test (I was, but it was kind of surreal).
There were more learnings than "writing a telegram bot is fairly easy": Watching how people talk with each other and unknowingly try to test the limits of bots is super interesting and entertaining at the same time.
"Ja und bin überrascht wie gut. Mache hauptsächlich Definitionsfragen und kann mir vorstellen, dass da paar klügere Search-Engine-Queries hinterhängen, aber ich konnte auf Englisch wechseln".
"Yes and I'm surprised how good. I'm mostly asking definition questions and can imagine there are some smarter search engine queries in the backend, but I was able to switch to English"
The complete code is a bit more than 100 lines of pure code, I've added a lot of comments for making it easier to understand, even for beginners.
Beware: this code was written down in a matter of minutes, there are a lot of things wrong with it (such as everything being in the main function), but it works for a minimal prototype which was the goal of all of this.